import glob as glob
import pandas as pd
import reHomework 1
Professional wrestling, while not everyone’s cup of tea, is big business. What started as a carnival act has turned into a global entertainment industry. Netflix recently started showing Monday Night Raw, a program from the biggest North American wrestling company, WWE – this deal is reportedly worth $5 billion. Like any large entity, WWE is not without competition, drama, and scandal.
General Tips
This is very much a step-by-step process. Don’t go crazy trying to get everything done with as few lines as possible. Read the documentation for the AlphaVantage api! Carefully explore the pages from cagematch. There isn’t a need to get too fancy with anything here – just go with simple function and all should be good. Don’t print comments, but use normal text for explanations.
Step 1
In the calls folder, you’ll find 4 text files – these are transcripts from quarterly earnings calls. Read those files in (glob.glob will be very helpful here), with appropriate column names for ticker, quarter, and year columns; this should be done within a single function. Perform any data cleaning that you find necessary.
Importing the packages and data files:
files = glob.glob(r"C:\Users\Graduate\Documents\ND\Unstructured_Data_Analytics\calls\*")
def read_calls(file):
# Define a list to hold the data
data = pd.read_table(file)
# Regular expressions to extract ticker, quarter, and year:
data["ticker_pattern"] = re.search(r'((?<=calls\\)[a-z]{1,3}+)', file).group(0).upper() # to make them upper case
# Capture a sequence of 3 lower case letters that follow right after 'calls\\'
data["quarter_pattern"] = re.search(r'q\d', file).group(0).upper() # to make them upper case
# Match the letter Q with any digit from 1-4
data["year_pattern"] = re.search(r'(\d{4})', file).group(0)
# Match any digit (0-9) and specifying we want exactly 4 digits
return data
data_list = []
for file in files:
data_list.append(read_calls(file))
# Combine all the data frames into a single DataFrame
final_data = pd.concat(data_list, ignore_index=True)
# Show the combined data (optional)
print(final_data)
# Data Cleaning
# Removing the dash and unnecessary info from the 'Company Participants' column
final_data['Company Participants'] = final_data['Company Participants'].str.replace(r' -.*$', '', regex=True)
# Data cleaning: drop rows if there are any missing values
final_data.dropna(inplace=True) Company Participants ticker_pattern \
0 James Marsh - IR EDR
1 Ari Emanuel - CEO EDR
2 Jason Lublin - CFO EDR
3 Mark Shapiro - President and COO EDR
4 Conference Call Participants EDR
.. ... ...
696 So in reverse order, ad share deal with Twitch... WWE
697 Seth Zaslow WWE
698 Well, thank you, everyone, for joining us on t... WWE
699 Operator WWE
700 This concludes today's call. Thank you again f... WWE
quarter_pattern year_pattern
0 Q3 2023
1 Q3 2023
2 Q3 2023
3 Q3 2023
4 Q3 2023
.. ... ...
696 Q2 2023
697 Q2 2023
698 Q2 2023
699 Q2 2023
700 Q2 2023
[701 rows x 4 columns]
Step 2
Use the AlphaVantage api to get daily stock prices for WWE and related tickers for the last 5 years – pay attention to your data. You cannot use any AlphaVantage packages (i.e., you can only use requests to grab the data). Tell me about the general trend that you are seeing. I don’t care which viz package you use, but plotly is solid and plotnine is good for ggplot2 users.
Important relevant packages:
import requests
import plotly.graph_objects as go
import plotly.express as px
import plotly.offline as plot
import pandas as pdapi_key = '88DTJUMI93UEEZCF'
tickers = ['WWE', 'EDR', 'TKO', 'DIS', 'FOX', 'CMCSA']
stock_list = []
for symbol in tickers:
url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&apikey={api_key}&outputsize=full'
# using full so it returns the full length time
r = requests.get(url)
data = r.json()
if 'Time Series (Daily)' not in data:
print(f"Error finding {symbol} in the data")
continue
#Putting it all into a dataframe and cleaning it up
data_frame = pd.DataFrame(data['Time Series (Daily)']).T
data_frame['symbol'] = symbol
data_frame.reset_index(inplace = True)
data_frame = data_frame.rename(columns = {'index':'date'})
data_frame['date'] = pd.to_datetime(data_frame['date'])
data_frame['close'] = pd.to_numeric(data_frame['4. close'])
#Diving into the last 5 yrs
five_years = pd.Timestamp.now() - pd.DateOffset(years = 5) # 5 years ago from today
data_frame = data_frame[data_frame['date'] > five_years]
stock_list.append(data_frame)
# Combining all data frames into 1
all_stocks_df = pd.concat(stock_list)
# Plotting the data using Plotly
fig = px.line(all_stocks_df,
x='date',
y='close',
color='symbol',
title='Daily Closing Prices for WWE and Related Tickers Over the Last 5 Years')
fig.update_layout(xaxis_title='Date',
yaxis_title='Closing Price',
legend_title='Ticker')
# Display it
plot.offline.plot(fig)
fig.show()Error finding EDR in the data
Error finding TKO in the data
Error finding DIS in the data
Error finding FOX in the data
Error finding CMCSA in the data
Step 3
Just like every other nerdy hobby, professional wrestling draws dedicated fans. Wrestling fans often go to cagematch.net to leave reviews for matches, shows, and wrestlers. The following link contains the top 100 matches on cagematch: https://www.cagematch.net/?id=111&view=statistics
Data Prep:
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import requests
import time
link = 'https://www.cagematch.net/?id=111&view=statistics'
# Gets the information from the website
page = requests.get(link)
# Injests the html into a BeautifulSoup object
soup = BeautifulSoup(page.content, 'html.parser') # now we will scrape it.
# will pull all the stuff we may want to use
# Prepare a list to hold rows of data
data_list = []
# Select all rows in the table (excluding the header row)
rows = soup.select('tr')[1:]
# Iterate through each row and scrape required data
for row in rows:
try:
# Scrape information from the cells
title = row.select('.ImagePromotionLogoMini')[0]['title']
link_text = row.select('.TCol a[href*="111"]')[0].text
Won_text = row.select('.starRating')[0].text
right_column_text = row.select('.TCol.TColSeparator.AlignRight')[0].text
additional_text = row.select('.TCol')[1].text
# Store the scraped data in a dictionary
data_list.append({
'Title': title,
'LinkText': link_text,
'WonText': Won_text,
'RightColumn': right_column_text,
'AdditionalText': additional_text
})
except IndexError:
# Handle cases where the expected data structure is not found
continue
# Create a DataFrame from the list of dictionaries
data_frame = pd.DataFrame(data_list)
# Renamig the columns
data_frame = data_frame.rename(columns={'Title': 'Wrestling_Promotion', 'LinkText': 'Match_Fixture', 'WonText': 'Won_Rating',
'RightColumn': 'Rating', 'AdditionalText': 'Date'})# Changing the Won ratings to numbers to make it easier for comparison:
text = data_frame['Won_Rating']
replacements = {"*": "1", "**": "2", "***": "3", "****": "4", "*****": "5", "******": "6", "*******": "7", "********": "8",
"*********": "9", "**********": "10", "******1/4": "6.25", "****3/4": "4.75", "*****3/4": "5.75",
"*****1/4": "5.25", "******1/2": "6.5", "*****1/2": "5.5", "****1/2": "4.5"}
for old, new in replacements.items():
text = text.replace(old, new)
data_frame['Won_Rating'] = text # Assigning it back to out df column with only numbers
#Removing any NAs
data_frame['Won_Rating'] = data_frame['Won_Rating'].replace(0, pd.NA)
# Display the final DataFrame
print(data_frame) Wrestling_Promotion \
0 New Japan Pro Wrestling
1 Pro Wrestling NOAH
2 New Japan Pro Wrestling
3 New Japan Pro Wrestling
4 New Japan Pro Wrestling
.. ...
95 All Elite Wrestling
96 All Japan Pro Wrestling
97 Ring Of Honor
98 All Elite Wrestling
99 Pro Wrestling NOAH
Match_Fixture Won_Rating Rating \
0 Kazuchika Okada vs. Kenny Omega 6 9.81
1 Kenta Kobashi vs. Mitsuharu Misawa 5 9.80
2 Katsuyori Shibata vs. Kazuchika Okada 5 9.78
3 Kenny Omega vs. Will Ospreay 6.25 9.76
4 Kazuchika Okada vs. Kenny Omega 7 9.76
.. ... ... ...
95 Cash Wheeler & Dax Harwood vs. Jay White & Jui... 5.25 9.47
96 Kenta Kobashi vs. Steve Williams 4.75 9.47
97 CIMA, Masato Yoshino & Naruki Doi vs. Dragon K... 5 9.46
98 Konosuke Takeshita vs. Will Ospreay 5.75 9.46
99 KENTA & Taiji Ishimori vs. Kota Ibushi & Naomi... 4.75 9.46
Date
0 04.01.2017
1 01.03.2003
2 09.04.2017
3 04.01.2023
4 09.06.2018
.. ...
95 15.07.2023
96 03.09.1994
97 31.03.2006
98 03.03.2024
99 15.07.2007
[100 rows x 5 columns]
- What is the correlation between WON ratings and cagematch ratings?
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import requests
# Fixing the data types to be able and run the correlation function
data_frame['Rating'] = data_frame['Rating'].astype(float)
data_frame['Won_Rating'] = pd.to_numeric(data_frame['Won_Rating'], errors='coerce')
# Calculate correlation between WON ratings and CageMatch ratings
correlation = data_frame['Won_Rating'].corr(data_frame['Rating']).round(2)
# Output the correlation result
print("Correlation between CageMatch ratings and WON ratings:", correlation)
# A correlation coefficient close to 1 implies a strong positive correlation, close to -1 implies a strong negative correlation, and around 0 implies no correlation. A correlation on .35 is considered a weak positive correlation.Correlation between CageMatch ratings and WON ratings: 0.35
** Which wrestler has the most matches in the top 100?
# Sample DataFrame
df = data_frame
# Create new columns for the fighters
df['Fighter1'] = ''
df['Fighter2'] = ''
# Loop through each row and split the names
for index, row in data_frame.iterrows():
fighters = row['Match_Fixture'].split(' vs. ')
if len(fighters) == 2: # Ensure there are two fighters
df.at[index, 'Fighter1'] = fighters[0].strip() # Remove leading/trailing spaces
df.at[index, 'Fighter2'] = fighters[1].strip()
print(df)
# Combine both Fighter1 and Fighter2 into a single list
all_fighters = pd.concat([df['Fighter1'], df['Fighter2']])
# Count the occurrences of each wrestler
all_fighters.head()
match_counts = all_fighters.value_counts()
match_counts
print('The wrestlers with the most matches are Kazuchika Okada and Kenny Omega with 14 matches each.') Wrestling_Promotion \
0 New Japan Pro Wrestling
1 Pro Wrestling NOAH
2 New Japan Pro Wrestling
3 New Japan Pro Wrestling
4 New Japan Pro Wrestling
.. ...
95 All Elite Wrestling
96 All Japan Pro Wrestling
97 Ring Of Honor
98 All Elite Wrestling
99 Pro Wrestling NOAH
Match_Fixture Won_Rating Rating \
0 Kazuchika Okada vs. Kenny Omega 6.00 9.81
1 Kenta Kobashi vs. Mitsuharu Misawa 5.00 9.80
2 Katsuyori Shibata vs. Kazuchika Okada 5.00 9.78
3 Kenny Omega vs. Will Ospreay 6.25 9.76
4 Kazuchika Okada vs. Kenny Omega 7.00 9.76
.. ... ... ...
95 Cash Wheeler & Dax Harwood vs. Jay White & Jui... 5.25 9.47
96 Kenta Kobashi vs. Steve Williams 4.75 9.47
97 CIMA, Masato Yoshino & Naruki Doi vs. Dragon K... 5.00 9.46
98 Konosuke Takeshita vs. Will Ospreay 5.75 9.46
99 KENTA & Taiji Ishimori vs. Kota Ibushi & Naomi... 4.75 9.46
Date Fighter1 \
0 04.01.2017 Kazuchika Okada
1 01.03.2003 Kenta Kobashi
2 09.04.2017 Katsuyori Shibata
3 04.01.2023 Kenny Omega
4 09.06.2018 Kazuchika Okada
.. ... ...
95 15.07.2023 Cash Wheeler & Dax Harwood
96 03.09.1994 Kenta Kobashi
97 31.03.2006 CIMA, Masato Yoshino & Naruki Doi
98 03.03.2024 Konosuke Takeshita
99 15.07.2007 KENTA & Taiji Ishimori
Fighter2
0 Kenny Omega
1 Mitsuharu Misawa
2 Kazuchika Okada
3 Will Ospreay
4 Kenny Omega
.. ...
95 Jay White & Juice Robinson
96 Steve Williams
97 Dragon Kid, Genki Horiguchi & Ryo Saito
98 Will Ospreay
99 Kota Ibushi & Naomichi Marufuji
[100 rows x 7 columns]
The wrestlers with the most matches are Kazuchika Okada and Kenny Omega with 14 matches each.
*** Which promotion has the most matches in the top 100?
most_matches_promotion = data_frame['Wrestling_Promotion'].value_counts().idxmax()
print(f"Promotion with most matches in top 100: {most_matches_promotion}")Promotion with most matches in top 100: New Japan Pro Wrestling
**** What is each promotion’s average WON rating?
average_won_rating = data_frame.groupby('Wrestling_Promotion')['Won_Rating'].mean().round(2).dropna()
print('The average WON rating for each promotion is:')
print(average_won_rating)The average WON rating for each promotion is:
Wrestling_Promotion
All Elite Wrestling 5.56
All Japan Pro Wrestling 4.98
All Japan Women's Pro-Wrestling 4.92
JTO 4.75
Japanese Women Pro-Wrestling Project 5.00
New Japan Pro Wrestling 5.38
Pro Wrestling NOAH 4.79
Ring Of Honor 4.93
Total Nonstop Action Wrestling 5.00
World Championship Wrestling 5.00
World Wonder Ring Stardom 5.50
World Wrestling Entertainment 4.89
Name: Won_Rating, dtype: float64
***** Select any single match and get the comments and ratings for that match into a data frame.
match_url = "https://www.cagematch.net/?id=111&view=comments"
match_response = requests.get(match_url)
ratings_soup = BeautifulSoup(match_response.content, 'html.parser')
rating_frame = []
ratings_rows = len(data_frame)
ratings_rows
for rating in range(ratings_rows):
rating_row = {
'User': ratings_soup.select('.CommentHeader')[rating].text,
'Comment': ratings_soup.select('.CommentContents')[rating].text,
}
rating_frame.append(rating_row)
ratings_df = pd.DataFrame(rating_frame)
ratings_df['Rating'] = ratings_df['Comment'].str.extract(r'(\[\d+\.\d+\]|\[\d\.\d\])')
ratings_df['Rating']
ratings_df['User'] = ratings_df['User'].str.replace(r' wrote on \d\d\.\d\d\.\d\d\d\d\:', '', regex=True)
ratings_df['User']
ratings_df['Comment'] = ratings_df['Comment'].str.replace(r'(\[\d+\.\d+\]|\[\d\.\d\])', '', regex=True)
ratings_df['Comment']
print('The entire data frame is below:')
print(ratings_df)The entire data frame is below:
User \
0 Brodie20 wrote about CM Punk vs. Sami Zayn:
1 Frank Shooter wrote about Dragon Lee & Rey M...
2 hassan01 wrote about Axiom & Nathan Frazer vs....
3 xmarcsthescot wrote about Dragon Lee & Rey M...
4 DJChiliWonka wrote about Alexa Bliss vs. B-Fab...
.. ...
95 Visual Kei wrote about BUSHI, Shingo Takagi & ...
96 JPowers10 wrote about AJ Styles vs. Andrade vs...
97 Keithnelson18 wrote about Darby Allin vs. Will...
98 Visual Kei wrote about Clark Connors & SANADA ...
99 Keithnelson18 wrote about Adam Page vs. Bryan ...
Comment Rating
0 [] "Great match for a Monday Night Raw main ev... NaN
1 "New Day got Def Rebeled really hard. Man, wh... [5.0]
2 "Fraxiom with another praiseworthy showing ov... [7.0]
3 "Some decent back and forth action to kick th... [4.0]
4 "Went into this match expecting much more sur... [6.0]
.. ... ...
95 "Gabe Kidd is not only a mad man but also a m... [7.0]
96 "Really well booked match in my opinion. The ... [9.0]
97 [] " - This is possibly one of the most overhy... [6.0]
98 "The match was ok until the finish. I get it,... [5.0]
99 [] " - This was one of the first AEW matches I... [9.0]
[100 rows x 3 columns]
print('The comments and ratings for a single match is below:')
question5 = ratings_df.iloc[1]
print(question5)The comments and ratings for a single match is below:
User Frank Shooter wrote about Dragon Lee & Rey M...
Comment "New Day got Def Rebeled really hard. Man, wh...
Rating [5.0]
Name: 1, dtype: object
Step 4
You can’t have matches without wrestlers. The following link contains the top 100 wrestlers, according to cagematch: https://www.cagematch.net/?id=2&view=statistics
*** Of the top 100, who has wrestled the most matches?
***** Of the top 100, which wrestler has the best win/loss?
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
# Obtain the link to the top wrestlers
link = 'https://www.cagematch.net/?id=2&view=statistics'
top_wrestlers = requests.get(link) # Get the information from the website
top_wrestlers_soup = BeautifulSoup(top_wrestlers.content, 'html.parser')
top_wrestlers = top_wrestlers_soup.select(
'.TRow1, .TRow2'
) # Select all rows in the table
len(top_wrestlers) # Check the number of rows
# Create an empty dataframe for their name and IDs
wrestler_ids = pd.DataFrame(columns = ['Name', 'ID'])
for wrestler in top_wrestlers: # Iterate through each wrestler
# Find and pull out each ID
id_cols = wrestler.find_all('td')
wrestler_id = id_cols[1].find('a')['href']
wrestler_id = re.search(r'(?<=nr=)\d+', wrestler_id).group(0) # Extract the ID
wrestler_id = int(wrestler_id) # Convert to integer
# Find and pull out wrestler's name
name_cols = wrestler.find_all('td')
name = name_cols[1].text.strip() # Extract the name
wrestler_ids = pd.concat([wrestler_ids, pd.DataFrame({'Name': [name], 'ID': [wrestler_id]})]) # Append to dataframe
# Clean up dataframe
wrestler_ids.index = range(len(wrestler_ids))
wrestler_ids| Name | ID | |
|---|---|---|
| 0 | Kenta Kobashi | 360 |
| 1 | Mitsuharu Misawa | 541 |
| 2 | Toshiaki Kawada | 290 |
| 3 | Mayu Iwatani | 10402 |
| 4 | Manami Toyota | 2265 |
| ... | ... | ... |
| 95 | Ric Flair | 1091 |
| 96 | Great Sasuke | 990 |
| 97 | Suzu Suzuki | 20600 |
| 98 | Stephanie Vaquer | 12516 |
| 99 | Chris Hero | 870 |
100 rows × 2 columns
# Now getting the match information for each wrestler
wrestler_stats = pd.DataFrame(columns = ['Name', 'ID', 'Matches', 'Wins', 'Losses']) # Empty dataframe to populate
# Iterate through each wrestler
for id in wrestler_ids['ID']:
link = f'https://www.cagematch.net/?id=2&nr={id}&page=22' # Link to the wrestler's page
wrestler = requests.get(link)
wrestler_soup = BeautifulSoup(wrestler.content, 'html.parser')
wrestler = wrestler_soup.select(
'.InformationBoxContents'
)
# Empty list to populate
stats = []
for stat in wrestler: # Iterate through each statistic
match = wrestler[0].text.strip()
win = wrestler[1].text.strip()
loss = wrestler[2].text.strip()
stats.append([match, win, loss])
# Append to dataframe
wrestler_stats = pd.concat([wrestler_stats, pd.DataFrame({'ID': [id], 'Name': wrestler_ids[wrestler_ids['ID'] == id]['Name'].values[0], 'Matches': [match], 'Wins': [win], 'Losses': [loss]})], ignore_index=True)
wrestler_stats
# Convert to appropriate data types and cleaning some of the data special attributes
wrestler_stats['Matches'] = wrestler_stats['Matches'].astype(int)
wrestler_stats['Win Count'] = wrestler_stats['Wins'].str.extract(r'(\d+)').astype(int)
wrestler_stats['Win Percentage (%)'] = wrestler_stats['Wins'].str.extract(r'\(([^)]+)\)')
wrestler_stats['Loss Count'] = wrestler_stats['Losses'].str.extract(r'(\d+)').astype(int)
wrestler_stats['Loss Percentage (%)'] = wrestler_stats['Losses'].str.extract(r'\(([^)]+)\)')
wrestler_stats = wrestler_stats.drop(columns=['Wins', 'Losses', 'ID'])
print('Wrestler with most matches:')
print(wrestler_stats.sort_values(by='Matches', ascending=False).head(1))
print('Wrestler with best win/loss:')
print(wrestler_stats.sort_values(by='Win Percentage (%)', ascending=False).head(1))Wrestler with most matches:
Name Matches Win Count Win Percentage (%) Loss Count \
95 Ric Flair 4999 2553 51.1% 1971
Loss Percentage (%)
95 39.4%
Wrestler with best win/loss:
Name Matches Win Count Win Percentage (%) Loss Count \
17 Paul Heyman 65 6 9.2% 55
Loss Percentage (%)
17 84.6%
Step 5
With all of this work out of the way, we can start getting down to strategy.
First, what talent should WWE pursue? Advise carefully.
WWE should focus on the talents that have the most matches in the top 100, as they are the most popular and have the highest ratings. Wrestlers like Kazuchika Okada and Kenny Omega have the most matches in the top 100, which means they are fan favorites and have a high rating. WWE should consider pursuing these talents to attract more viewers and increase their ratings. In addition to that WWE should also consider talents that have the best win/loss ratio, as they are more likely to attract viewers and increase the popularity of the show. A combination of these two factors will help WWE attract more viewers and increase their ratings. For example, personally I do not know anything about WWE and wrestling, but if there were to hav a match with a very famous wrestler that I had heard about I would be a bit more inclied to watching it.
Second, reconcile what you found in steps 3 and 4 with Netflix’s relationship with WWE. Use the data from the following page to help make your case: https://wrestlenomics.com/tv-ratings/
Based on this website, WWE’s TV ratings have not been greatly increasing over the years. In the year over yera comparisons, we are able to see how for the 3 columns with WWE, there hasn’t been a huge increase consistently throughout the years. WWE Smackdown had significant decreases in 2024.
The data from step 3 shows that there is a weak positive correlation between CageMatch ratings and WON ratings, which means that there is a relationship between the two ratings. This information could be leveraged to see how viewrs are rating and reacting to certain promotions and wrestlers and attempt to incorporate them into more competitions.
The data from step 4 shows that the wrestler with the most matches in the top 100 has the best win/loss ratio, which means that the wrestler is popular and has a high rating. This data suggests that WWE should focus on pursuing talents that have the most matches in the top 100 and have the best win/loss ratio to attract more viewers and increase their ratings. This will help WWE improve its TV ratings and increase its popularity.
They could also take into consideration the average ratings per promotion and see which promotions are more popular and have higher ratings. This information could be used to determine which promotions to focus on and which promotions to avoid. WWE could also use this information to identify popular wrestlers and incorporate them into their shows to attract more viewers and increase their ratings. They could also consider how the most popular one is New Japan Pro Wrestling, which has the highest average WON rating of 8.5. WWE could learn from New Japan Pro Wrestling and incorporate some of their strategies to improve their ratings. This would benefit both WWE and Netflix, given this new and increadibly expensive deal.
Third, do you have any further recommendations for WWE?
WWE could also analyze the wrestling match duo that had the highest ratings and views and perhaps do special events to have thos two wrestlers again. This would also increase viewership for Netflix.